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Motivated by practical applications, chiefly clinical trials, we study 
the regret achievable for stochastic bandits under the constraint that 
the employed policy must split trials into a small number of batches. 

We propose a simple policy, and show that a very small number of 
batches gives close to minimax optimal regret bounds. As a byprod¬ 
uct, we derive optimal policies with low switching cost for stochastic 
bandits. 

1. Introduction. All clinical trials are run in batches: groups of patients 
are treated simultaneously, with the data from each batch influencing the 
design of the next. This structure arises as it is impractical to measure out¬ 
comes (rewards) for each patient before deciding what to do next. Despite 
the fact that this system is codified into law for drug approval, it has re¬ 
ceived scant attention from statisticians. What can be achieved with a small 
number of batches? How big should these batches be? How should results 
in one batch affect the structure of the next? 

We address these questions using the multi-armed bandit framework. This 
encapsulates an “exploration vs. exploitation” dilemma fundamental to eth¬ 
ical clinical research [30, 34]. In the basic problem, there are two populations 
of patients (or arms), corresponding to different treatments. At each point 
in time t = 1,... ,T, a decision maker chooses to sample one, and receives a 
random reward dictated by the efficacy of the treatment. The objective is 
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to devise a series of choices—a policy—maximizing the expected cumulative 
reward over T rounds. There is thus a clear tradeoff between discovering 
which treatment is the most effective—or exploration —and administering 
the best treatment to as many patients as possible—or exploitation. 

The importance of batching extends beyond clinical trials. In recent years, 
the bandit framework has been used to study problems in economics, finance, 
chemical engineering, scheduling, marketing and, more recently, internet ad¬ 
vertising. This last application has been the driving force behind a recent 
surge of interest in many variations of bandit problems over the past decade. 
Yet, even in internet advertising, technical constraints often force data to be 
considered in batches; although the size of these batches is usually based on 
technical convenience rather than on statistical reasoning. Discovering the 
optimal structure, size and number of batches has applications in marketing 
[8, 31] and simulations [14]. 

In clinical trials, batches may be formal—the different phases required 
for approval of a new drug by the US Food and Drug Administration—or 
informal—with a pilot, a full trial, and then diffusion to the full population 
that may benefit. In an informal setup, the second step may be skipped if the 
pilot is successful enough. In this three-stage approach, the first, and usually 
second, phases focus on exploration, while the third focuses on exploitation. 
This is in stark contrast to the basic bandit problem described above, which 
effectively consists of T batches, each containing a single patient. 

We describe a policy that performs well with a small fixed number of 
batches. A fixed number of batches reflects clinical practice, but presents 
mathematical challenges. Nonetheless, we identify batch sizes that lead to 
a minimax regret bounds as low as the best non-batched algorithms. We 
further show that these batch sizes perform well empirically. Together, these 
features suggest that near-optimal policies could be implemented with only 
small changes to current clinical practice. 

2. Description of the problem. 

2.1. Notation. For any positive integer n, define [n] = {1,..., n}, and for 
any ni < n 2 , [ni : 77 , 2 ] = {ni,... , 772 } and (771 : 772 ] = {ni -|- 1,... , 772 }. For any 
positive number x, let \x\ denote the largest integer n such that n<x and 
[x \2 denotes the largest even integer m such that m <x. Additionally, for 
any real numbers a and b, a Ab = min(a, b) and aV b = max(a, h). Further, 
define log(x) = 1 V (logx). l(-) denotes the indicator function. 

If X, J are closed intervals of M, then X < J if x < y for all x G X, y G J". 

Finally, for two sequences {ut)t, {vt)t, we write ut = 0{vt) or ut < vt 
if there exists a constant U > 0 such that \ut\ < C'IxtI for any T. Moreover, 
we write ut = Q{vt) if ut = 0{vt) and vt = 0{ut)- 
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2.2. Framework. We employ a two-armed bandit framework with hori¬ 
zon T > 2. Central ideas and intuitions are well captured by this concise 
framework. Extensions to iC-armed bandit problems are mostly technical 
(see, for instance, [28]). 

At each time t € [T], the decision maker chooses an arm i £ {1,2} and ob¬ 
serves a reward that comes from a sequence of i.i.d. draws ... from 

some unknown distribution with expected value . We assume that the 
distributions zzb) are standardized sub-Gaussian, that is, J < 

gA /2 £qj. A £ M. Note that these include Gaussian distributions with vari¬ 
ance at most 1, and distributions supported on an interval of length at most 
2. Rescaling extends the framework to other variance parameters 

For any integer M £ [2 : T], let T = {ti,... ,tM} be an ordered sequence, 
or grid, of integers such that 1 < ti < ■ ■ ■ < tM = T■ It defines a partition 
S = {5i,..., Sm} of [T] where 5i = [1 : ti] and Sk = {tk-i '■ tk] for A: £ [2 : 
Mj. The set Sk is called kth batch. An M-batch policy is a couple (T, tt) 
where T = {ti,... ,tM} is a grid and vr = {vrt,^ = 1 ,... ,T} is a sequence 
of random variables vrt £ {1,2}, indicating which arm to pull at each time 
t = 1,..., T, which depend only on observations from batches strictly prior to 
the current one. Formally, for each t £ [T], let J{t) £ [M] be the index of the 
current batch . Then, for t £ , irt can only depend on observations 

: s £ U • • • U : s < tj^t)_,}. 

Denote by * £ {1,2} the optimal arm defined by = maxjgji 2 } by 
t £ {1,2} the suboptimal arm, and by A := — /xiti > 0 the gap between 

the optimal expected reward and the suboptimal expected reward. 

The performance of a policy vr is measured by its (cumulative) regret at 
time T 

T 

Rt = ^T(vr) = T/rW - 

t=i 

Denoting by Tj(t) = l('^s = *))*€ {1,2} the number of times arm i was 

pulled before time t>2, regret can be rewritten as Rt = AEr£(T). 

2.3. Previous results. Bandit problems are well understood in the case 
where M = T, that is, when the decision maker can use all available data at 
each time t £ [Tj. Bounds on the cumulative regret Rt for stochastic multi¬ 
armed bandits come in two flavors: minimax or adaptive. Minimax bounds 
hold uniformly in A over a suitable subset of the positive real line such as 
the intervals (0,1) or even (0, 00 ). The first results of this kind are attributed 
to Vogel [36, 37], who proved that Rt = 0(\/T) in the two-armed case (see 
also [6, 20]). 
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Adaptive policies exhibit regret bounds that may be much smaller than 
the order of Vt when A is large. Such bounds were proved in the seminal 
paper of Lai and Robbins [25] in an asymptotic framework (see also [10]). 
While leading to tight constants, this framework washes out the correct 
dependency on A of the logarithmic terms. In fact, recent research [1-3, 28] 
has revealed that Rt = 0(AT A log(rA^)/A). 

Nonetheless, a systematic analysis of the batched case does not exist, 
even though Ucb2 [2] and improved-Ucb [3] are implicitly M-batch poli¬ 
cies with M = 0(logT). These algorithms achieve optimal adaptive bounds. 
Thus, employing a batched policy is only a constraint when the number 
of batches M is much smaller than logT, as is often the case in clini¬ 
cal practice. Similarly, in the minimax framework, M-batch policies, with 
M = 0(loglogT), lead to the optimal regret bound (up to logarithmic terms) 
of C>(\/TTogTogTogT') [11, 12]. The sub-logarithmic range M <ClogT is es¬ 
sential in applications where M is small and constant, like clinical trials. In 
particular, we wish to bound the regret for small values of M, such as 2, 3 
or 4. 

2.4. Literature. This paper connects to two lines of work: batched se¬ 

quential estimation [17, 18, 21, 33] and multistage clinical trials. Somerville 
[32] and Maurice [26] studied the two-batch bandit problem in a minimax 
framework under a Gaussian assumption. They prove that an “explore-then- 
commit” type policy has regret of order £qj. gf g^p A; a 

result we recover and extend (see Section 4.3). 

Colton [15, 16] introduced a Bayesian perspective, initiating a long line of 
work (see [22] for a recent overview). Most of this work focuses on the case 
of two-three batches, with isolated exceptions [13, 22]. Typically, this work 
claims the size of the first batch should be of order VT, which agrees with 
our results, up to a logarithmic term (see Section 4.2). 

Batched procedures have a long history in clinical trials (see, for instance, 
[23] and [5]). Usually, batches are of the same size, or of random size, with 
the latter case providing robustness. This literature also focuses on inference 
questions rather than cumulative regret. A notable exception provides an ad- 
hoc objective to optimize batch size but recovers the suboptimal VT in the 
case of two batches [4] . 

2.5. Outline. Section 3 introduces a general class of M-batch policies we 
call explore-then-eommit (etc) policies. These policies are close to clinical 
practice within batches. The performance of generic etc policies are de¬ 
tailed in Proposition 1, found in Section 3.3. In Section 4, we study several 
instantiations of this generic policy and provide regret bounds with explicit, 
and often drastic, dependency on the number of batches M. Indeed, in Sec¬ 
tion 4.3, we describe a policy in which regret decreases doubly exponentially 
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fast with the number of batches. 

Two of the instantiations provide adaptive and minimax types of bounds, 
respectively. Specifically, we describe two M-batch policies, vr^ and vr^ that 
enjoy the following bounds on the regret: 

V/^I^(TA2) 


/ T 




A 


<rV(2-2--)iog«M(ri/(2--i)), e [0,1/4). 

Note that the bound for vr^ corresponds to the optimal adaptive rate 
log(TA^)/A when M = 0(log(T/log(T))) and the bound for vr^ corresponds 
to the optimal minimax rate VT when M = 0 (log log T). The latter is en¬ 
tirely feasible in clinical settings. As a byproduct of our results, we show that 
the adaptive optimal bounds can be obtained with a policy that switches 
between arms less than 0(log(T/log(T))) times, while the optimal mini¬ 
max bounds only require 0(loglogr) switches. Indeed, etc policies can be 
adapted to switch at most once in each batch. 

Section 5 then examines the lower bounds on regret of any M-batch policy, 
and shows that the policies identified are optimal, up to logarithmic terms, 
within the class of M-batch policies. Finally, in Section 6 we compare policies 
through simulations using both standard distributions and real data from a 
clinical trial, and show that the policies we identify perform well even with 
a very small number of batches. 


3. Explore-then-commit policies. In this section, we describe a simple 
structure that can be used to build policies: explore-then-commit (etc). 
This structure consists of pulling each arm the same number of times in 
each non-terminal batch, and checking after each batch whether, according 
to some statistical test, one arm dominates the other. If one dominates, 
then only that arm is pulled until T. If, at the beginning of the terminal 
batch, neither arm has been declared dominant, then the policy commits to 
the arm with the largest average past reward. This “go for broke” step is 
dictated by regret minimization: in the last batch exploration is pointless as 
the information it produces can never be used. 

Any policy built using this principle is completely characterized by two 
elements: the testing criterion and the sizes of the batches. 


3.1. Statistical test. We begin by describing the statistical test employed 
before non-terminal batches. Denote by 

£=i 

the empirical mean after s > 1 pulls of arm i. This estimator allows for the 
construction of a collection of upper and lower conhdence bounds for of 




6 


PERCHET, RIGOLLET, CHASSANG AND SNOWBERG 


the form 

+ and 

where B^*^ = 2y^21og(r/s)/s (with the convention that Bq^ = oo). It follows 
from Lemma B.l that for any r G [T], 

(1) ¥{3s < T : AT® > /I® + B®} VP{3s < r : //(*) < _ bW} < 

These bounds enable us to design the following family of tests {^t}te[T] 
with values in {1,2,T} where T indicates that the test was inconclusive. 
This test is only implemented at times t G [T] at which each arm has been 
pulled exactly s = t/2 times. However, for completeness, we define the test 
at all times t. For t>l, define 

^ * G {1, 2}, if Ti{t) = T 2 {t) = tl2 and + b|^^, j ^ i, 

'i^t \ 

T, otherwise. 

The errors of such tests are controlled as follows. 


Lemma 1. Let S C [T] he a deterministic subset of even times such that 
Ti(t) = T 2 {t) = t/2, for t G 5. Partition S into S- U ^ 5+, where 

Let t denote the smallest element o/5+. Then 

Af AA 

(i) F{ipi 7^ ★) < — and (ii) F{3t G 5_ : VJt = f) < —. 

Proof. Assume without loss of generality that *= 1. 

(i) By definition, 

{Vf 1} = {4;> - < pg + } c {Ei U El U eS], 

where E} = Ef = - b|^’}, and Ef = - 

< 2B|y2 + 2B|^2}' It follows from (1) that with r = t/2, P(Fil) VP(£'j) < 
2t/r. 

Finally, for any t G in particular for t = t, we have 
Ef C = 0 . 

(ii) Focus on the case t G 5_, where A < 16y^log(2r/t)/t. Here, 

U {v,=2) = u - b‘,% > p‘;> + b|;»} c u {£.' u ei u f’). 

tG(S_ t^S— 
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where, are defined above and Fj^ = < 0} = 0 as * = 1. It 

follows from (1), that with t = t 

\&S- ^ Te5_ ^ n 

3.2. Go for broke. In the last batch, the etc structure will “go for broke” 

by selecting the arm i with the largest average. Formally, at time t, let 
tpt = i iff hr (t) — (t) ’ ti^s broken arbitrarily. While this criterion 

may select the suboptimal arm with higher probability than the statistical 
test described in the previous subsection, it also increases the probability of 
selecting the correct arm by eliminating inconclusive results. This statement 
is formalized in the following lemma. The proof follows immediately from 
Lemma B.l. 

Lemma 2. Fix an even time t € [T], and assume that both arms have 
been pulled t/2 times each (i.e., Ti(t) =tl2, for i = l,2j. Going for broke 
leads to a probability of error 

P('0t 7^ *) < exp(—fA^/16). 

3.3. Explore-then-commit policy. In a batched process, an extra con¬ 
straint is that past observations can only be inspected at a specific set of 
times T = {fi,... ,tM-i} C [T], called a grid. 

The generic etc policy uses a deterministic grid E that is fixed before¬ 
hand, and is described more formally in Figure 1. Informally, at each decision 
time ti,... ,tM- 2 , the policy implements the statistical test. If one arm is 
determined to be better than the other, it is pulled until T. If no arm is 
declared best, then both arms are pulled the same number of times in the 
next batch. 

We denote by e* € {1,2} the arm pulled at time t £ [T], and employ an 
external source of randomness to generate the variables et. With N an even 
number, let (ei,... ,sn) be uniformly distributed over the subset Vat = {u G 
{1,2}^ : = 1) = iV/2}.^ This randomization has no effect on the 

policy, and could easily be replaced by any other mechanism that pulls each 
arm an equal number of times. For example, a mechanism that pulls one 
arm for the hrst half of the batch, and the other for the second half, may be 
used if switching costs are a concern. 


^Odd numbers for the deadlines ti could be considered, at the cost of rounding problems 
and complexity, by defining Vn = {u G {1,2}^ : | Jfi = L “ Jfi = 2)| < 1}. 
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Input: 

• Horizon: T. 

• Number of batches: M G [2 : T], 

• Grid: T= {h,.. C [T], to = 0, tu = T, IS^I =tm- tm-i is 

even for m G [M — 1]. 

Initialization: 

• Let ,..., e|g ^ |) be uniformly distributed over® V| 5 ^|, for 

m G [M]. 

• The index i of the batch in which a best arm was identified is initialized 
to ^ = o . 

Policy: 

1. For t G [I : ti], choose tt* = 

2. For m G [2 : M — 1]: 

(a) If £ 7 ^ o, then -Kt = for t G {tm-i '■ tm]- 

(b) Else, compute 

i. If — -L) select an arm at random, that is, vr^ = for 

t S (tjTi—i . tin] ■ 

ii. Else, i = m-l and vrt = for t G (tm-i : tm]- 

3. Eor t G {tM-i,T]: 

(a) If ^ 7 ^ 0 , TTt = (pt^. 

(b) Otherwise, go for broke, that is, vr^ = iptM-i ■ 


“In the case where \Sm\ is not an even number, we use the general definition of 
footnote 4 for V| 5 „|. 


Fig. 1. Generic explore-then-commit policy with grid T- 


In the terminal batch 5 m, if no arm was determined to be optimal in any 
prior batch, the etc policy will go for broke by selecting the arm i such that 
i) — (tM i)’ broken arbitrarily. 

To describe the regret incurred by a generic etc policy, we introduce 
extra notation. Eor any A G (0,1), let t(A) = T A '!9(A) where t?(A) is the 
smallest integer such that 


A > 16i 


/log[2T/t?(A)] 


t9(A) 
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Notice that the above definition implies that r(A) > 2 and 


( 2 ) 


256 — 
r(A)<^log 



The time r(A) is, up to a multiplicative constant, the theoretical time at 
which the optimal arm will be declared better by the statistical test with 
large enough probability. As A is unknown, the grid will not usually contain 
this value. Thus, the relevant time is the first posterior to t(A) in a grid: 


(3) m(A,r) = 


min{m E {1,. 

M- 1, 




if r(A) < tM-i, 
otherwise. 


The first proposition gives an upper bound for the regret incurred by a 
generic etc policy run with a given set of times T = {fi,... 


Proposition 1. Given the time horizon T € N, the number of batches 
M E [2,T], and the grid T = {fi,... C [T] with t^ = 0. For any A E 

[0,1], the generic etc policy described in Figure 1 incurs regret bounded 

(4) i2r(A,r) <9At^(A,r) = M- 1). 

Proof. Denote fh = m(A,T). Note that t^ denotes the theoretical time 
on the grid at which the statistical test will declare * to be (with high 
probability) the better arm. 

We first examine the case where t^ < M — 1. Dehne the following events: 

m 

■^m = ~ -L}, Bm = {^tm ~ t} ^rn = 7^ *}• 

n=l 

Regret can be incurred in one of the following three manners: 

(i) by exploring before time tm, 

(ii) by choosing arm f before time tfh- this happens on event Bm, 

(iii) by not committing to the optimal arm * at the optimal time tm- this 
happens on event Cm- 

Error (i) is unavoidable and may occur with probability close to one. It 
corresponds to the exploration part of the policy and leads to an additional 
term tmA/2 in the regret. An error of the type (ii) or (iii) can lead to a regret 
of at most PA, so we need to ensure that they occur with low probability. 
Therefore, the regret incurred by the policy is bounded as 

m-l \ 

[J Am—1 n Bm I + ^{Bm—1 D Cm) j 

,m=I / 

with the convention that Aq is the whole probability space. 


(5) Rt{A,T)<^-^ + TAE 
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Next, observe that m is chosen such that 

16 i ^ < ig / 


log(2r/tm-i) 


tm. 


tm—1 


In particular, tfh plays the role of t in Lemma 1. Thus, using part (i) of 
Lemma 1, 

nBfn-lr^Cin)<^. 


Moreover, using part (ii) of the same lemma, 

/ffl—l \ 


4 ^ 777 , 


P U 


\m=l 


Together with (5) this implies regret is bounded by i?'r(A,T) < 9Atm- 
In the case where trn(A,T) = M — 1, Lemma 2 shows that the go for broke 
test errs with probability at most exp(—tM-iA^/16), which gives that 

Rt{A,T) < 9At^(A,r) 
using the same arguments as before. □ 


Proposition 1 helps choose a grid by showing how that choice reduces to 
an optimal discretization problem. 


4. Functionals, grids and bonnds. The regret bound of Proposition 1 
critically depends on the choice of the grid T = {ti,..., tM-i} C [T], Ideally, 
we would like to optimize the right-hand side of (4) with respect to the tmS- 
For a fixed A, this problem is easy, and it is enough to choose M = 2, ~ 

r(A) to obtain optimal regret bounds of the order R*{A) = log(TA^)/A. For 
unknown A, the problem is not well defined: as observed by [15, 16], it con¬ 
sists in optimizing a function R{A,'T) for all A, and there is no choice that 
is uniformly better than others. To overcome this limitation, we minimize 
pre-specified real-valued functionals of R{-,'T). The functionals we focus on 
are: 

Txs[.Rt(‘)T)] = sup {Rt{A,T) — CR*{A)}, C>0 Excess regret, 

Ae[o,i] 

Rt(A,T) 

Fcr[RT{',T)] = sup — b — Competitive ratio, 

Ae[o,i] R*{A) 

Emx[i?T(-,T)] = sup Rt{A,T) 

Ae[o,i] 


Maximum. 







BATCHED BANDITS 


11 


Optimizing different functionals leads to different optimal grids. We investi¬ 
gate the properties of these functionals and grids in the rest of this section.^ 


4.1. Excess regret and the arithmetic grid. We begin with the simple grid 
consisting in a uniform discretization of [T], This is particularly prominent 
in the group sequential testing literature [23]. As we will see, even in a 
favorable setup, it yields poor regret bounds. 

Assume, for simplicity, that T = 2KM for some positive integer K, so 
that the grid is defined by tm = mT/M. In this case, the right-hand side 
of (4) is bounded below by Ati = AT/M. For small M, this lower bound is 
linear in TA, which is a trivial bound on regret. To obtain a valid upper 
bound, note that 


256 ^/TA 


tmiA,T)<r{A) + — <^log 


128 



Moreover, if m(A, T) = M — 1 then A is of the order of y/l/T, thus, TA< 
1/A. Together with (4), this yields the following theorem. 


Theorem 1. The etc policy implemented with the arithmetic grid de¬ 
fined above ensures that, for any A G [0,1], 

Rt{A,T) < (^^log(rA2) + — j ata. 

The optimal rate is recovered if M = T. However, the arithmetic grid 
leads to a bound on the excess regret of the order of AT when T is large 
and M constant. 

In Section 5, the bound of Theorem 1 is shown to be optimal for excess 
regret, up to logarithmic factors. Clearly, this criterion provides little useful 
guidance on how to attack the batched bandit problem when M is small. 


4.2. Competitive ratio and the geometric grid. The geometric grid is de¬ 
fined as T = {ti,... ,tM-i}, where tm = and a > 2 is a parameter to 

be chosen later. To bound regret using (4), note that if m(A,T) < M — 2, 
then 

Rt{A,T) < 9Aa™(^’^) < 9aAr(A) < , 


®One could also consider the Bayesian criterion F,fiRT{;T)] = fRT{A,T)d7v{A) 
where tt is a given prior distribution on A, rather than on the expected rewards as in 
the traditional Bayesian bandit literature [7]. 
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and if m(A,T) = M — 1 , then t(A) > tM- 2 - Then, (4), together with Lem¬ 
ma B.2 yields 

i?T(A,r) < 9Aa“-i 

for a > A 2. We have proved the following theorem. 


Theorem 2. The etc policy implemented with the geometric grid de¬ 
fined above for the value a :=2{ when M <log{T/{logT)) ensures 

that, for any A G [0,1], 


Rt{A,T)< 


T y/^log(rA2) 
logry A 


ata. 


For a logarithmic number of batches, M = 0(logT), the geometric grid 
leads to the optimal regret bound 


Rt{A,T)< 


iog(rA2) 

A 


ATA. 


This bound shows that the geometric grid leads to a deterioration of the 
regret bound by a factor (T/log(T))^/^, which can be interpreted as a 
uniform bound on the competitive ratio. For example, for M = 2 and A = 1, 
this leads to the y/T regret bound observed in the Bayesian literature, which 
is also optimal in the minimax sense. However, this minimax optimal bound 
is not valid for all values of A. Indeed, maximizing over A > 0 yields 


SUpiiT(T,A) 

A 


which yields the minimax rate VT when M > log(T/log(T)), as expected 
from prior results. The decay in M can be made even faster if one focuses 
on the maximum risk, by employing our “minimax grid.” 


4.3. Maximum risk and the minimax grid. The objective of this grid 
is to minimize the maximum risk, and to recover the classical distribution 
independent minimax bound in VT. The intuition behind this grid comes 
from Proposition 1, in which At^(^- 7 -) is the most important term to control. 
Consider a grid T = {ti,..., tM-i}-, where the are defined recursively as 
tm+i = f{tm) so that, by definition, tm{^^T) ^ /('^(^) “ !)• we minimize 
the maximum risk, A/(t(A)) should be the smallest possible term, and 
constant with respect to A. This is ensured by choosing /(r(A) — 1) = a/A 
or, equivalently, by choosing f{x) = a/T~^{x -\- 1 ) for a suitable notion of 
the inverse. This yields Atm{A,T) ^ so that the parameter a is actually 
a bound on the regret. This parameter also has to be large enough so that 
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the regret T sup^ = 2T/y/etM-i incurred in the go for broke 

step is also of the order of a. The formal definition below uses not only this 
delicate recurrence, but also takes care of rounding problems. 

Let ui = a, for some a > 0 to be chosen later, and Uj = f{uj-i) where 

for all j G {2,..., M — 1}. The minimax grid T = {ti,..., Im-i} has points 
given by tm = G {1,... ,M - 1}. 

If m{A,T) < M —2, then it follows from (4) that Rt{A,T) < 9Atm{A,T)j 
and as r(A) is the smallest integer such that A > 16a//(r(A)), we have 

< ^fiT{A) - 1) < 16a. 

As discussed above, if a is greater than 2y/2T /then the regret 
is also bounded by 16a when m{A,T) = M — 1. Therefore, in both cases, the 
regret is bounded by 16a. Before finding an a satisfying the above conditions, 
note that it follows from Lemma B.3 that, as long as < 2T, 

^ UM-i ^_ aSM-2 _ 

2 “ 301og^"-3/2(2r/a^M-5)’ 

with the notation Sk -=2 — 2“^. Therefore, we need to choose a such that 

I ig / 22^ \ 

qSm -1 > , /-Tlog'^^“®/^( —5 - I and 15a‘^*^“^ < 2T. 

“ V 16e \a^M-s J - 

It follows from Lemma B.4 that the choice 

a := (2T)^/^"-i iogV4-(3/4)i/(2^-i)('('2r)i5/(2'^-i)) 

ensures both conditions when 2^ < log(2T)/6. We emphasize that 

logV4-(3/4)l/(2^-l)((2r)15/(2^-l))<2 withM=Llog2(log(2T)/6)J. 

As a consequence, in order to get the optimal minimax rate of VT, one only 
needs [log 2 log(T)J batches. If more batches are available, then our policy 
implicitly combines some of them. We have proved the following theorem. 

Theorem 3. The etc policy over the minimax grid with 

a = ^Qgi/4-(3/4)i/(2^-i)^(^2r)^®/(2"”^)) 

ensures that, for any M such that 2^ < log(2T)/6, 

sup i?^(A,r)<rV(2-2^-")logV4-(3/4)l/(2M_l)^2-l/(2M_i)^^ 

0<A<1 

which is minimax optimal, that is, sup^ Rt{A,T) < VT, for M > log 2 log(r). 
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Table 1 

Regret and decision times of the ETC policy with the minimax grid for M = 2,3,4,5. In 

the table, It = log(T) 


M 

fi = sup^ Rt(A,T) 

t2 

ts 

*4 

2 

rj^2j3 




3 





4 

-2^8/15^1/5 

2^12/15^~1/5 

2^14/15 ^ — 2/5 


5 

rpl6/31jJ/31 

2^24/31 

y28/31pll/31 

j.30/31pl4/31 


Table 1 gives the regret bounds (without constant factors) and the deci¬ 
sion times of the etc policy with the minimax grid for M = 2,3,4,5. 

The ETC policy with the minimax grid can easily be adapted to have 
only O(loglogT) switches, and yet still achieve regret of optimal order y/T. 
To do so, in each batch one arm should be pulled for the first half of the 
batch, and the other for the second half, leading to only one switch within 
the batch, until the policy commits to a single arm. To ensure that a switch 
does not occur between batches, the first arm pulled in a batch should be set 
to the last arm pulled in the previous batch, assuming that the policy has 
not yet committed. This strategy is relevant in applications such as labor 
economics and industrial policy, where switching from an arm to the other 
may be expensive [24]. In this context, our policy compares favorably with 
the best current policies constrained to have log 2 log(T) switches, which lead 
to a regret bound of order y/T log log log T [11]. 

5. Lower bounds. In this section, we address the optimality of the regret 
bounds derived above for the specific functionals Fxs, Fcr and F^x- The 
results below do not merely characterize optimality (up to logarithmic terms) 
of the chosen grid within the class of etc policies, but also optimality of the 
final policy among the class of all M-batch policies. 

Theorem 4. Fix T >2 and M G [2 : T]. Any M-batch policy (7~, vr), 
must satisfy the following lower bounds: 

sup {Ai?T(A,r)}>rV^, 

Ae(o,i] 

sup {i?T(A,r)}>rV(2-2^-^). 

Ae(o,i] 


sup 

Ae(o,i] 
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Proof. Fix ^ = 1) ■ • ■) M . Focusing first on excess risk, it fol¬ 

lows from Proposition A.l that 

sup 

Ae(o,i] I ^ J 

> max y^|^^exp(-tj_iA|/2) - 

> max I - Vtk\- 

l<k<M [ j 

As tk+i > tk, the last quantity above is minimized if all the terms are of 
order 1. This yields tk+i =tk + a, for some positive constant a. As tM = T, 
we get that tj ~ jT/M, and taking A = 1 yields 


sup |iiT(A,T) 
Ae(o,i] I 


il>4>T 

AJ - 4 ~ M 


Proposition A.l also yields 


sup {Ai2r(A,T)} > max^^ 
Ae(o,i] ^ ^ 


4 





> max 
k 


tk-\-l 1 

/ ' 


Arguments similar to the ones for the excess regret above, give the lower 
bound for the competitive ratio. Finally, 


sup 

Ae(o,i] 


i?T(A, T) > max 

k 



^ktj 

4 



> max 
k 


tk+1 \ 

4v^/ 


gives the lower bound for maximum risk. □ 




6. Simulations. In this final section, we briefly compare, in simulations, 
the various policies (grids) introduced above. These are also compared with 
Ucb2 [2], which, as noted above, can be seen as an M = O(logT) batch 
trial. A more complete exploration can be found in [29]. 

The minimax and geometric grids perform well using an order of magni¬ 
tude fewer batches than Ucb2. The number of batches required for Ucb2 
make its use for medical trials functionally impossible. For example, a study 










16 


PERCHET, RIGOLLET, CHASSANG AND SNOWBERG 

Gaussian with variance = 1 Student’s t with two degrees of freedom 

\ 

i\ 
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Number of Subjects (T - in thousands) 



u 


Bernoulli 


Poisson 


\ 

\ 




\ 

\ 




0 50 100 150 200 250 

Number of Subjects {T - in thousands) 


0 50 100 150 200 250 

Number of Subjects (T - in thousands) 


- Arithmetic 

— — - Geometric 

Minimax 

- UCB2 


Fig. 2. Performance of policies with different distributions and M = 5. (For all distri¬ 
butions = 0.5, and = 0.5 + A = 0.6.,) 

that examined STI status six months after an intervention in [27] would 
require 1.5 years to run using minimax batch sizes, but Ucb 2 would use as 
many as 56 batches, meaning the study would take 28 years. 

Specific examples of performance can be found in Figure 2. This figure 
compares average regret produced by different policies and many values of 
the total sample, T. For each value of T in the figure, a sample is drawn, grids 
are computed based on M and T, the policy is implemented, and average 
regret is calculated based on the choices in the policy. This is repeated 100 
times for each value of T. 

The number of batches is set at M = 5 for all policies except Ucb2. 
Each panel considers one of four distributions: two continuous—Gaussian 
and Student’s f-distribution—and two discrete—Bernoulli and Poisson. In 
all cases, we set the difference between the arms at A = 0.1. 

A few patterns are immediately apparent. First, the arithmetic grid pro¬ 
duces relatively constant average regret above a certain number of partic¬ 
ipants. The intuition is straightforward: when T is large enough, the etc 
policy will tend to commit after the first batch, as the first evaluation point 
will be greater than r(A). In the arithmetic grid, the size of this first batch 
is a constant proportion of the overall participant pool, so average regret 
will be constant when T is large enough. 

Second, the minimax grid also produces relatively constant average regret, 
although this holds for smaller values of T, and produces lower regret than 
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the geometric or arithmetic case when M is small. This indicates, using 
the intuition above, that the minimax grid excels at choosing the optimal 
batch size to allow a decision to commit very close to t(A). This advantage 
over the arithmetic and geometric grids is clear. The minimax grid can even 
produce lower regret than Ucb2, using an order of magnitude fewer batches. 

Third, and hnally, the Ucb2 algorithm generally produces lower regret 
than any of the policies considered in this manuscript for all distributions 
except the heavy-tailed Student’s t-distribution, for which batched policies 
perform significantly better. Indeed, the Ucb2 is calibrated for sub-Gaussian 
rewards, as are batched policies. However, even with heavy-tailed distribu¬ 
tions, the central limit theorem implies that batching a large number of 
observations returns averages that are sub-Gaussian; see the supplementary 
material [29]. Even when Ucb2 performes better, this increase in perfor¬ 
mance comes at a steep practical cost: many more batches. For example, 
with draws from a Gaussian distribution, and T between 10,000 and 40,000, 
the minimax grid with only 5 batches performs better than Ucb2. Through¬ 
out this range, Ucb2 uses roughy 50 batches. 

It is worth noting that in medical trials, there is nothing special about 
waiting six months for data from an intervention. Trials of cancer drugs 
often measure variables like the 1- or 3-year survival rate, or the increase in 
average survival compared to a baseline that may be greater than a year. In 
these cases, the ability to get relatively low regret with a small number of 
batches is extremely important. 

APPENDIX A: TOOLS FOR LOWER BOUNDS 

Our results hinge on tools for lower bounds, recently adapted to the 
bandit setting in [9]. Specifically, we reduce the problem of deciding which 
arm to pull to that of hypothesis testing. Gonsider the following two can¬ 
didate setups for the rewards distributions; Pi = AA(A, 1 ) ® AA(0,1) and 
P 2 = AA(0,1) ( 8 ) AA(A,1), that is, under Pi successive pulls of arm 1 yield 
AA(A, I) rewards and successive pulls of arm 2 yield AA(0,1) rewards. The 
opposite is true for P 2 , so arm i is optimal under Pj. 

At a given time t G [T], the choice of tt^ G {1,2} is a test between P{ 
and P 2 where P/ denotes the distribution of observations available at time 
t under Pi. Let P(t, 7 r) denote the regret incurred by policy vr at time t. We 
have R{t,'K) = Al( 7 rt 7 ^ i). Denote by Ej the expectation under P/, so that 

ES[fi(t,7r)| V E^[fi(t,7r)| > ^(EilR{t,n)] + 

= fT;(7r, = 2) + F‘(^, = l)). 

Next, we use the following lemma (see [35], Chapter 2). 
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Lemma A.l. Let Pi and P 2 he two probability distributions such that 
Pi <C -P 2 - Then for any measurable set A, 

Pi{A) + P 2 {A<^) > iexp(-KL(Pi,P 2 )), 

where KL(v) is the Kullback-Leibler divergence defined by 

KL(P..P.) = /l„g(g)dF, 

Here, observations are generated by an M-batch policy vr. Recall that 
J{t) G [M] denotes the index of the current batch. As vr depends on obser¬ 
vations : s G Pi is a product distribution of at most 

marginals. It is straightforward to show that whatever arms are observed 
over the history, KL(P^,P 2 ) = Therefore, 

Pj[P(t,7r)] VP^[P(t,7r)] > iexp(-tj(4)_iAV2). 

Summing over t yields the following result. 

Proposition A.l. FixT = {ti,.. . Pm} and let (T,7r) be an M-batch 
policy. There exist reward distributions with gap A, such that (T, vr) has 
regret bounded below as, defining to:=0, 

^ t- 

Rt{A,T) > A^^exp(-tj_iA2/2). 
i=i 

A variety of lower bounds in Section 5 are shown using this proposition. 

APPENDIX B: TECHNICAL LEMMAS 

A process {Zt}t>o is a sub-Gaussian martingale difference sequence if 
K[Zt+i\Zi,... ,Zt] = 0 and IE[e^^*+i] < for every A > 0,t > 0. 

Lemma B.l. Let Zt be a sub-Gaussian martingale difference sequence. 
Then, for every 6 > 0 and every integer t > 1, 

Moreover, for every integer t >1, 

p{3t<.,Z,>2Alog(L)}<i 
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Proof. The first inequality follows from a classical Chernoff bound. 
To prove the maximal inequality, define et = 2y^|dog(|^. Note that, by 

Jensen’s inequality, for any a > 0, the process {exp(asZs )}5 is a sub¬ 
martingale. Therefore, it follows from Doob’s maximal inequality [19], The¬ 
orem 3.2, page 314, that for every r]> 0 and every integer t > 1, 

P{3s < t, sZs > ??} = P{3s < t, > e“^} 

<E[e“‘^‘]e-"’’. 


Next, as Zt is sub-Gaussian, we have E[exp(atZt)] < exp(a^t/2). The 
above, and optimizing with respect to a > 0 yields 

P{3s < t, sZs > ??} < exp 

Next, using a peeling argument, one obtains 



Llog2p)J 'j 

F{3t <T,Zt> et} < E r U {Zt > St} \ 
m=0 I 4=2"* J 

Uog2WJ j'2"*+l 'I 

— ^ LJ r 

m=0 J 

Liog2p)J 'j 

< j] Pj [J {tZt>2^e2m+i}\ 
m=0 U=2"* J 


Uog2('r)J 
< ^ exp 

m=0 

_ 2™+i J 


(2™'e2^+i 

2m+2 


< 


m=0 

2log2p)+2 J 
r 4 


r 4 

< 6 . 


Hence, the result. □ 

Lemma B.2. Fix two positive integers T and M < log(r). It holds that 
^^g-(aM-iA2)/32 < 32^ 1og((TA2)/32) 

A 


if a > 


/ mt\ 

■ 
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Proof. Fix the value of a and observe that M < logT implies that 
a>e. Define x := TA‘^/32 > 0 and 9 := /T > 0. The first inequality is 
rewritten as 

(7) xe~^^ <a\og{x). 


We will prove that this inequality is true for all x > 0, given that 6 and a 
satisfy some relation. This, in turn, gives a condition that depends solely on 
a, ensuring that the statement of the lemma is true for all A > 0. 

Equation (7) immediately holds if x < e as alog(x) = a> e. Similarly, 
^ l/(0e). Thus (7) holds for all x > l/-\/0 when a>a*:= 1 / {0\og{l / 0)). 
We assume this inequality holds. Thus, we must show that (7) holds for 
X G [e, l/yfO], For x < a, the derivative of the right-hand side is ^ > 1, while 
the derivative of the left-hand side is smaller than 1. As a consequence, (7) 
holds for every x < a, in particular for every x < a*. To summarize, whenever 


equation (7) holds on (0,e], on [e,a*] and on -|-oo), thus on (0,-|-oo) 

as a* > l/v^. Next, if >MT/logT, we obtain 

a a^—f T 

—log(^^^ 



1 

log(r) 


log 



log(T) 

M 


M-l 


The result follows from \og{T)/M > 1, hence a/a* > 1. 


□ 


Lemma B.3. Fix a > 1,6 > e and let ui,U 2 , ...he defined by ui = a and 
'“'=+1 = «Y^ iog(&/»fc) • Define Sk = 0 for k<0 and 

k 

Sk = Y^ 2-^ = 2-2"^ for k>0. 
j=o 

Then, for any M such that 15a^^-^ < b, and all k G [M — 3], 

15 log'^''-^/^ (5 / ) 

Moreover, for k G [M — 2 : M], we also have 

a^k-i 

Uk > 


151og‘^''“^/^(6/a'^*f-5) 
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Proof. Define Zk = log( 6 /a‘^'=). It is straightforward to show that Zk < 
iff < b. In particular, a^M -2 < 5 implies that Zk < for all 

/c G [0 : M — 4]. Next, we have 


( 8 ) 


— Q, 


Uk 


\og{h/uk) 


> a 


i 

\ 


iSk- 


1-2 log(V«fc) 


15z 


Observe that > 15, so for all A: G [0, M — 1] we have 


log( 6 /ttfc) <log( 6 /a^'=-i) + logl5 + ^^^logZfc _2 < 5zk-i. 


This yields 


ZkU^^"^ log(6/«fc) < 


Plugging this bound into (8) completes the proof for k G [M — 3]. 
Finally, if A: > M — 2, we have by induction on k from M — 3, 


Uk+i = a 


Uk 


> a. 




jSk-i 


\og{b/uk) 


\og{b/uk) 

Moreover, as b/a^*^-^ > 15, for k G [M — 3,M — 1] we have 


log( 6 /rifc) < log( 6 /a‘^'=-i) + logl5 + ^^^logZM-s < ^zm- 5 - 


□ 


Lemma B.4. If 2^ < log(4T)/6, the following specific choice 

ensures that 

and 

(10) 15a'^'^-2 < 2T. 


Proof. Immediate for M = 2. For M >2, 2^ < log(4r) implies 
q5m-i ^ 2riog^^-3/^((2r)^®/(2'^"^)) 


> 2T 


16 


15 


2 M _ 1 


iog( 2 r) 


1/4 


> 2T. 
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Therefore, a > , which in turn implies that 

qSm-1 — 


15 


> \ j _!2l7"loff'^Af-3/4 

- 16e ^ 


2T 

tSm-S 


This completes the proof of (9). Equation (10) follows if 

(11) 15‘S'm-I ^2T)^M-2 lQg(>S’M-3SM-2)/4('(-22-^15/(2^-l)^ < i^2T) 


Sm-1 


Using that SM-k < 2, we get that the left-hand side of (10) is smaller than 

15^ log((2r)^5/(2^-^)) < 22501og((2r)2^”'^). 

The result follows using 2^ < log(2T)/6, which implies that the right-hand 
side in the above inequality is bounded by (2T)^ . □ 


SUPPLEMENTARY MATERIAL 

Supplement to “Batched bandit problems” 

(DOI: 10.1214/15-AOS1381SUPP; .pdf). The supplementary material [29] 
contains additional simulations, including some using real data. 
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